Take-Home Exercise 2

Published

December 6, 2023

Modified

December 9, 2023

Take-home Exercise 2: Applied Spatial Interaction Models: A case study of Singapore public bus commuter flows

Motivation and Objective

This take-home exercise is motivated by two main reasons. Firstly, despite increasing amounts of open data available for public consumption, there has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions.

Secondly, there is a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.

Hence, your task for this take-home exercise is to conduct a case study to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building a spatial interaction models to determine factors affecting urban mobility patterns of public bus transit.

The Data

Open Government Data

For the purpose of this assignment, data from several open government sources will be used:

  • Passenger Volume by Origin Destination Bus Stops, Bus Stop Location, Train Station and Train Station Exit Point, just to name a few of them, from LTA DataMall.

  • Master Plan 2019 Subzone Boundary, HDB Property Information, School Directory and Information and other relevant data from Data.gov.sg.

Specially collected data

  • Businesses, retail and services, leisure and recreation, etc geospatial data sets assemble by course instructor. (Refer to eLearn)

The Task

Geospatial Data Science

  • Derive an analytical hexagon data of 325m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the traffic analysis zone (TAZ).

  • With reference to the time intervals provided in the table below, construct an O-D matrix of commuter flows for a time interval of your choice by integrating Passenger Volume by Origin Destination Bus Stops and Bus Stop Location from LTA DataMall. The O-D matrix must be aggregated at the analytics hexagon level

    Peak hour period Bus tap on time
    Weekday morning peak 6am to 9am
    Weekday afternoon peak 5pm to 8pm
    Weekend/holiday morning peak 11am to 2pm
    Weekend/holiday evening peak 4pm to 7pm
  • Display the O-D flows of the passenger trips by using appropriate geovisualisation methods (not more than 5 maps).

  • Describe the spatial patterns revealed by the geovisualisation (not more than 100 words per visual).

  • Assemble at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.

  • Compute a distance matrix by using the analytical hexagon data derived earlier.

Spatial Interaction Modelling

  • Calibrate spatial interactive models to determine factors affecting urban commuting flows at the selected time interval.

  • Present the modelling results by using appropriate geovisualisation and graphical visualisation methods. (Not more than 5 visuals)

  • With reference to the Spatial Interaction Model output tables, maps and data visualisation prepared, describe the modelling results. (not more than 100 words per visual).

FIRST STEP

Load in necessary packages:

pacman::p_load(tmap, sf, sp, DT, stplanr, httr,
               performance, reshape2,
               ggpubr, tidyverse)

As we are aware, there is an increasing amounts of open data available, but there has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions. In this section I will be performing integration of various data sources.

I will first check the various database selected to see how is it possible to build a hollistic database.

Data Import, Extraction, Processing

Geospatial Data - Bus Stop

BusStop <- st_read(dsn="data/geospatial",
                  layer="BusStop")%>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

I see that BusStop is a point geometry SF with “BUS_STOP_N”, “BUS_ROOF_N”, “LOC_DESC” and its geometry.

MPSZ <- st_read(dsn="data/geospatial",                   
                layer="MPSZ-2019")%>%   
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

I see that MPSZ is a Multipolygon geometry SF with “SUBZONE_N”, “SUBZONE_C”, “PLN_AREA_N”, “PLN_AREA_C”, “REGION_N”, “REGION_C” and its geometry.

Creating Hexagon

Traffic Analysis Zone (TAZ)

Traffic analysis zones are universally used in travel demand modelling to represent the spatial distribution of trip1 origins and destinations, as well as the population, employment and other spatial attributes that generate or otherwise influence travel demand. The urban area is divided into a set of mutually exclusive and collectively exhaustive zones. While travel actually occurs from one point in the urban region to another, all trip origins and destinations in a travel demand model are represented at the spatially aggregate level of the movement from an origin zone to a destination zone. These movements are further aggregated within network assignment models as originating and ending at single points within the origin and destination zones – the zone centroids.

As we should derive an analytical hexagon data of 325m (this distance is the perpendicular distance between the centre of the hexagon and its edges) to represent the TAZ,

Per the documentation of st_make_grid:

cellsize is numeric of length 1 or 2 with target cellsize: ..for hexagonal cells the distance between opposite edges (edge length is cellsize/sqrt(3)). A length units object can be passed, or an area unit object with area size of the square or hexagonal cell.

In this case, the distance between opposite edges should be 325m * 2. And the length of hexagon should be 325m.

BusStop_hexagon_grid = st_make_grid(BusStop, 325, what = "polygons", square = FALSE)

BusStop_hexagon_sf = st_sf(geometry = BusStop_hexagon_grid) %>%
  # add grid ID
  mutate(grid_id = 1:length(lengths(BusStop_hexagon_grid)))

We will next use st_intersection() for point and polygon overlay, to combine the data sets. This will provide us with output in point sf object.

BusStop_hexagon <- st_intersection(BusStop_hexagon_sf, BusStop) %>%   
  select(1,2,4) %>%   
  st_drop_geometry()
MPSZ_hexagon <- st_intersection(BusStop_hexagon_sf, MPSZ)  %>%  
  select(1:3) 

BusStop_MPSZ_hexagon <- left_join(BusStop_hexagon,MPSZ_hexagon,
                                  by = "grid_id")

Note this BusStop_MPSZ_hexagon only contain the hexagon which has got bus stop within the area.

write_rds(BusStop_hexagon, "data/rds/BusStop_hexagon.rds")

Performing the Relational Join (to update attribute table of one geospatial with another aspatial data set)

Now we will next combine this onto our our odbs_peak data frame which shows the total number of trips from particular bus stop during peak hour.

Aspatial Data

Next we import the Aspatial Data of PASSENGER VOLUME BY ORIGIN DESTINATION BUS STATIONS, downloaded via API (postman GET) from Data Mall LTA. For the purpose of this exercise the Aug 2023 Data will be used.

OD_bus <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
#non-spatial data with no geometry features
OD_bus$ORIGIN_PT_CODE <- as.factor(OD_bus$ORIGIN_PT_CODE)
OD_bus$DESTINATION_PT_CODE <- as.factor(OD_bus$DESTINATION_PT_CODE) 

Noted that the aspatial data for bus has total of 7 columns YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, TIME_PER_HOUR, TOTAL_TRIPS.

For the purpose of this exercise, we only look at those Bus tap on time on Weekday morning peak, between 6am to 9am.

OD_bus_peak <- OD_bus %>% 
  filter(DAY_TYPE == "WEEKDAY" & 
           (TIME_PER_HOUR >=6 & TIME_PER_HOUR <=9))  
summary(OD_bus_peak)
  YEAR_MONTH          DAY_TYPE         TIME_PER_HOUR     PT_TYPE         
 Length:669211      Length:669211      Min.   :6.000   Length:669211     
 Class :character   Class :character   1st Qu.:7.000   Class :character  
 Mode  :character   Mode  :character   Median :8.000   Mode  :character  
                                       Mean   :7.595                     
                                       3rd Qu.:9.000                     
                                       Max.   :9.000                     
                                                                         
 ORIGIN_PT_CODE   DESTINATION_PT_CODE  TOTAL_TRIPS      
 22009  :  1925   22009  :  1803      Min.   :    1.00  
 84009  :  1817   52009  :  1785      1st Qu.:    2.00  
 75009  :  1757   84009  :  1732      Median :    7.00  
 52009  :  1678   75009  :  1658      Mean   :   39.49  
 59009  :  1656   59009  :  1534      3rd Qu.:   24.00  
 46009  :  1535   46009  :  1506      Max.   :29151.00  
 (Other):658843   (Other):659193                        

Computing the distance matrix

We use the spDists() coz faster.

BusStop_hexagon_sp <- as(BusStop_hexagon_sf, "Spatial")
dist <- spDists(BusStop_hexagon_sp,                  
                longlat = FALSE)  
#because or subzone is already svr21. Otherwise it will treat data as x and y, then calculate great circle distance.  
head(dist, n=c(10, 10)) #only list first 10 col and 10 rows
           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,]    0.0000  562.9165 1125.8330 1688.7495 2251.6660 2814.5826 3377.4991
 [2,]  562.9165    0.0000  562.9165 1125.8330 1688.7495 2251.6660 2814.5826
 [3,] 1125.8330  562.9165    0.0000  562.9165 1125.8330 1688.7495 2251.6660
 [4,] 1688.7495 1125.8330  562.9165    0.0000  562.9165 1125.8330 1688.7495
 [5,] 2251.6660 1688.7495 1125.8330  562.9165    0.0000  562.9165 1125.8330
 [6,] 2814.5826 2251.6660 1688.7495 1125.8330  562.9165    0.0000  562.9165
 [7,] 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330  562.9165    0.0000
 [8,] 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330  562.9165
 [9,] 4503.3321 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495 1125.8330
[10,] 5066.2486 4503.3321 3940.4156 3377.4991 2814.5826 2251.6660 1688.7495
           [,8]      [,9]     [,10]
 [1,] 3940.4156 4503.3321 5066.2486
 [2,] 3377.4991 3940.4156 4503.3321
 [3,] 2814.5826 3377.4991 3940.4156
 [4,] 2251.6660 2814.5826 3377.4991
 [5,] 1688.7495 2251.6660 2814.5826
 [6,] 1125.8330 1688.7495 2251.6660
 [7,]  562.9165 1125.8330 1688.7495
 [8,]    0.0000  562.9165 1125.8330
 [9,]  562.9165    0.0000  562.9165
[10,] 1125.8330  562.9165    0.0000

Notice that the output dist is a matrix object class of R. Also notice that the column heanders and row headers are not labeled with the planning subzone codes.

Labelling column and row headers of a distance matrix

First, we will create a list sorted according to the the distance matrix by planning sub-zone code.

grid <- BusStop_hexagon_sf$grid_id

Next we will attach SUBZONE_C to row and column for distance matrix matching ahead

colnames(dist) <- paste0(grid) 
rownames(dist) <- paste0(grid)

Pivoting distance value by SUBZONE_C

Next, we will pivot the distance matrix into a long table by using the row and column subzone codes as show in the code chunk below.

distPair <- melt(dist) %>% 
  rename(dist = value)
head(distPair, 10)
   Var1 Var2      dist
1     1    1    0.0000
2     2    1  562.9165
3     3    1 1125.8330
4     4    1 1688.7495
5     5    1 2251.6660
6     6    1 2814.5826
7     7    1 3377.4991
8     8    1 3940.4156
9     9    1 4503.3321
10   10    1 5066.2486

Note that melt() is a old reshape tool function, that take dist matrix and convert it to long table, 1. origin, 2. destination, 3. distance matrix

Notice that the within zone distance is 0.

Updating intra-zonal distances

Now I am going to append a constant value to replace the intra-zonal distance of 0, first select and find out the minimum value of the distance by using summary().

distPair <- distPair %>%
  rename(orig = Var1,
         dest = Var2)
distPair %>%   
  filter(dist > 0) %>%   
  summary()
      orig            dest            dist      
 Min.   :    1   Min.   :    1   Min.   :  325  
 1st Qu.: 3301   1st Qu.: 3301   1st Qu.:11500  
 Median : 6600   Median : 6600   Median :18086  
 Mean   : 6600   Mean   : 6600   Mean   :18991  
 3rd Qu.: 9900   3rd Qu.: 9900   3rd Qu.:25526  
 Max.   :13200   Max.   :13200   Max.   :51798  
write_rds(distPair, "data/rds/distPair.rds") 

Preparing flow data

Note that although I would want to keep all the fields intact as I wasnt sure what are the data I need for the next steps. It is taking too much processing power and space, so i will summarize and aggregate the value of selected time.

BusStop_Trips <- left_join(OD_bus_peak, BusStop_hexagon, #the left join is so to get grid ID from hexagon file
                      by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  drop_na() %>%
  rename(ORIGIN_BS = ORIGIN_PT_CODE,
         ORIGIN_GRID = grid_id,
         DESTIN_BS = DESTINATION_PT_CODE
         )%>%
  group_by(ORIGIN_GRID, ORIGIN_BS, DESTIN_BS) %>% 
  summarize(TRIPS = sum(TOTAL_TRIPS))

But there isn’t grid ID for my destination so i will now do the following to get complete picture:

BusStop_Trips <- BusStop_Trips %>%
  left_join(BusStop_hexagon, #the left join is so to get grid ID from hexagon file
                      by = c("DESTIN_BS" = "BUS_STOP_N")) %>%
  drop_na() %>%
    rename(DESTIN_GRID = grid_id)%>%
  select(1,2,5,3,4)
BusStop_Trips <- unique(BusStop_Trips)

Visualising Spatial Interaction

Separating intra-flow from passenger volume df

To add three new fields in BusStop_Trips dataframe, that is to be used for flow.

str(BusStop_Trips)
gropd_df [236,683 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ ORIGIN_GRID: int [1:236683] 52 52 52 52 52 52 149 149 149 200 ...
 $ ORIGIN_BS  : chr [1:236683] "25059" "25059" "25059" "25059" ...
 $ DESTIN_GRID: int [1:236683] 783 345 295 580 630 735 783 345 735 2079 ...
 $ DESTIN_BS  : chr [1:236683] "24719" "25709" "25719" "25771" ...
 $ TRIPS      : num [1:236683] 54 2 1 2 1 2 34 13 3 2 ...
 - attr(*, "groups")= tibble [4,962 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ ORIGIN_GRID: int [1:4962] 52 149 200 201 245 295 295 295 298 342 ...
  ..$ ORIGIN_BS  : chr [1:4962] "25059" "25751" "26379" "26369" ...
  ..$ .rows      : list<int> [1:4962] 
  .. ..$ : int [1:6] 1 2 3 4 5 6
  .. ..$ : int [1:3] 7 8 9
  .. ..$ : int [1:5] 10 11 12 13 14
  .. ..$ : int [1:8] 15 16 17 18 19 20 21 22
  .. ..$ : int [1:9] 23 24 25 26 27 28 29 30 31
  .. ..$ : int [1:8] 32 33 34 35 36 37 38 39
  .. ..$ : int [1:10] 40 41 42 43 44 45 46 47 48 49
  .. ..$ : int [1:14] 50 51 52 53 54 55 56 57 58 59 ...
  .. ..$ : int [1:10] 64 65 66 67 68 69 70 71 72 73
  .. ..$ : int [1:7] 74 75 76 77 78 79 80
  .. ..$ : int [1:6] 81 82 83 84 85 86
  .. ..$ : int [1:7] 87 88 89 90 91 92 93
  .. ..$ : int [1:11] 94 95 96 97 98 99 100 101 102 103 ...
  .. ..$ : int [1:16] 105 106 107 108 109 110 111 112 113 114 ...
  .. ..$ : int [1:7] 121 122 123 124 125 126 127
  .. ..$ : int [1:12] 128 129 130 131 132 133 134 135 136 137 ...
  .. ..$ : int [1:17] 140 141 142 143 144 145 146 147 148 149 ...
  .. ..$ : int [1:8] 157 158 159 160 161 162 163 164
  .. ..$ : int [1:12] 165 166 167 168 169 170 171 172 173 174 ...
  .. ..$ : int [1:8] 177 178 179 180 181 182 183 184
  .. ..$ : int [1:14] 185 186 187 188 189 190 191 192 193 194 ...
  .. ..$ : int [1:9] 199 200 201 202 203 204 205 206 207
  .. ..$ : int [1:4] 208 209 210 211
  .. ..$ : int [1:9] 212 213 214 215 216 217 218 219 220
  .. ..$ : int [1:7] 221 222 223 224 225 226 227
  .. ..$ : int [1:5] 228 229 230 231 232
  .. ..$ : int [1:7] 233 234 235 236 237 238 239
  .. ..$ : int [1:7] 240 241 242 243 244 245 246
  .. ..$ : int 247
  .. ..$ : int [1:4] 248 249 250 251
  .. ..$ : int [1:5] 252 253 254 255 256
  .. ..$ : int [1:5] 257 258 259 260 261
  .. ..$ : int [1:20] 262 263 264 265 266 267 268 269 270 271 ...
  .. ..$ : int [1:4] 282 283 284 285
  .. ..$ : int [1:13] 286 287 288 289 290 291 292 293 294 295 ...
  .. ..$ : int [1:4] 299 300 301 302
  .. ..$ : int [1:7] 303 304 305 306 307 308 309
  .. ..$ : int [1:8] 310 311 312 313 314 315 316 317
  .. ..$ : int [1:8] 318 319 320 321 322 323 324 325
  .. ..$ : int [1:4] 326 327 328 329
  .. ..$ : int [1:14] 330 331 332 333 334 335 336 337 338 339 ...
  .. ..$ : int [1:6] 344 345 346 347 348 349
  .. ..$ : int [1:6] 350 351 352 353 354 355
  .. ..$ : int [1:5] 356 357 358 359 360
  .. ..$ : int [1:10] 361 362 363 364 365 366 367 368 369 370
  .. ..$ : int [1:6] 371 372 373 374 375 376
  .. ..$ : int [1:6] 377 378 379 380 381 382
  .. ..$ : int [1:4] 383 384 385 386
  .. ..$ : int [1:16] 387 388 389 390 391 392 393 394 395 396 ...
  .. ..$ : int [1:5] 403 404 405 406 407
  .. ..$ : int [1:9] 408 409 410 411 412 413 414 415 416
  .. ..$ : int [1:7] 417 418 419 420 421 422 423
  .. ..$ : int [1:5] 424 425 426 427 428
  .. ..$ : int [1:3] 429 430 431
  .. ..$ : int [1:6] 432 433 434 435 436 437
  .. ..$ : int [1:8] 438 439 440 441 442 443 444 445
  .. ..$ : int [1:2] 446 447
  .. ..$ : int [1:9] 448 449 450 451 452 453 454 455 456
  .. ..$ : int [1:3] 457 458 459
  .. ..$ : int [1:4] 460 461 462 463
  .. ..$ : int [1:4] 464 465 466 467
  .. ..$ : int [1:5] 468 469 470 471 472
  .. ..$ : int [1:4] 473 474 475 476
  .. ..$ : int [1:32] 477 478 479 480 481 482 483 484 485 486 ...
  .. ..$ : int [1:5] 509 510 511 512 513
  .. ..$ : int [1:42] 514 515 516 517 518 519 520 521 522 523 ...
  .. ..$ : int [1:8] 556 557 558 559 560 561 562 563
  .. ..$ : int [1:7] 564 565 566 567 568 569 570
  .. ..$ : int [1:5] 571 572 573 574 575
  .. ..$ : int [1:5] 576 577 578 579 580
  .. ..$ : int [1:3] 581 582 583
  .. ..$ : int [1:3] 584 585 586
  .. ..$ : int [1:17] 587 588 589 590 591 592 593 594 595 596 ...
  .. ..$ : int [1:3] 604 605 606
  .. ..$ : int [1:106] 607 608 609 610 611 612 613 614 615 616 ...
  .. ..$ : int [1:7] 713 714 715 716 717 718 719
  .. ..$ : int [1:12] 720 721 722 723 724 725 726 727 728 729 ...
  .. ..$ : int 732
  .. ..$ : int [1:3] 733 734 735
  .. ..$ : int [1:6] 736 737 738 739 740 741
  .. ..$ : int [1:3] 742 743 744
  .. ..$ : int [1:25] 745 746 747 748 749 750 751 752 753 754 ...
  .. ..$ : int [1:9] 770 771 772 773 774 775 776 777 778
  .. ..$ : int [1:34] 779 780 781 782 783 784 785 786 787 788 ...
  .. ..$ : int [1:10] 813 814 815 816 817 818 819 820 821 822
  .. ..$ : int 823
  .. ..$ : int [1:41] 824 825 826 827 828 829 830 831 832 833 ...
  .. ..$ : int [1:4] 865 866 867 868
  .. ..$ : int [1:11] 869 870 871 872 873 874 875 876 877 878 ...
  .. ..$ : int [1:5] 880 881 882 883 884
  .. ..$ : int [1:2] 885 886
  .. ..$ : int [1:2] 887 888
  .. ..$ : int [1:9] 889 890 891 892 893 894 895 896 897
  .. ..$ : int [1:4] 898 899 900 901
  .. ..$ : int 902
  .. ..$ : int [1:3] 903 904 905
  .. ..$ : int [1:10] 906 907 908 909 910 911 912 913 914 915
  .. ..$ : int [1:3] 916 917 918
  .. ..$ : int [1:6] 919 920 921 922 923 924
  .. .. [list output truncated]
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

This above checks show that my ORIGIN_BS column is chr, therefore i will now convert it to factor:

BusStop_Trips$ORIGIN_BS <- as.factor(BusStop_Trips$ORIGIN_BS)
BusStop_Trips$FlowNoIntra <- ifelse(
  BusStop_Trips$ORIGIN_GRID == BusStop_Trips$DESTIN_GRID, 
  0, BusStop_Trips$TRIPS)
BusStop_Trips$offset <- ifelse(
  BusStop_Trips$ORIGIN_GRID == BusStop_Trips$DESTIN_GRID, 
  0.000001, 1)

Combining passenger volume data with distance value

BusStop_Trips1 <- BusStop_Trips %>%
  left_join (distPair,
             by = c("ORIGIN_GRID" = "orig",
                    "DESTIN_GRID" = "dest"))
summary(BusStop_Trips1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:236683     
 1st Qu.: 5846   84009  :   564   1st Qu.: 5956   Class :character  
 Median : 7637   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7473   52009  :   538   Mean   : 7457                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   444   Max.   :13117                     
                 (Other):233556                                     
     TRIPS          FlowNoIntra          offset              dist      
 Min.   :    1.0   Min.   :    0.0   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1.000000   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1.000000   Median : 3748  
 Mean   :  109.5   Mean   :  109.4   Mean   :0.995475   Mean   : 4865  
 3rd Qu.:   58.0   3rd Qu.:   58.0   3rd Qu.:1.000000   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1.000000   Max.   :24758  
                                                                       

From the summary above, it is noted that there are some origin and destination bus stops within the same hexagon grid. For the purpose of this exercise we will exclude them, as the aim is to find out the interozonal (hexagon) flow.

BusStop_Trips2 <- BusStop_Trips1 %>%
  filter(dist>0)
summary(BusStop_Trips2)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:235612     
 1st Qu.: 5866   84009  :   563   1st Qu.: 5956   Class :character  
 Median : 7639   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7475   52009  :   538   Mean   : 7459                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   443   Max.   :13117                     
                 (Other):232487                                     
     TRIPS          FlowNoIntra          offset       dist      
 Min.   :    1.0   Min.   :    1.0   Min.   :1   Min.   :  325  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1   Median : 3748  
 Mean   :  109.9   Mean   :  109.9   Mean   :1   Mean   : 4887  
 3rd Qu.:   59.0   3rd Qu.:   59.0   3rd Qu.:1   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1   Max.   :24758  
                                                                

Now we see the minimum is 325, which is the the perpendicular distance between the centre of the hexagon and its edges (next hexagon).

Visualization for the TOTAL TRIPS taken at Origin and Destination hexagon area

origin_density_map <- left_join(BusStop_hexagon_sf, BusStop_Trips1,
                                by =c("grid_id" = "ORIGIN_GRID")) %>%
  drop_na() %>%
  rename(ORIGIN_GRID = grid_id) %>%
  group_by(ORIGIN_GRID) %>% 
  summarize(TOTAL_TRIPS = sum(TRIPS), AVERAGE_DIST = weighted.mean(dist, w = TRIPS))

destin_density_map <- left_join(BusStop_hexagon_sf, BusStop_Trips1,
                                by =c("grid_id" = "DESTIN_GRID")) %>%
  drop_na() %>%
  rename(DESTIN_GRID = grid_id) %>%
  group_by(DESTIN_GRID) %>% 
  summarize(TOTAL_TRIPS = sum(TRIPS), AVERAGE_DIST = weighted.mean(dist, w = TRIPS))

We will just narrow down to look at number of trips above 100k during this period.

tmap_mode("view")
#tmap_options(check.and.fix = TRUE)
map_honeycomb = tm_shape(origin_density_map %>%
                         filter(TOTAL_TRIPS>100000)) +
  tm_fill(
    col = "TOTAL_TRIPS",
    palette = "Reds",
    style = "cont",
    title = "Number of Trips by Origin Bus Stop within the Area",
    alpha = 0.4,
    popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
    popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

map_honeycomb1 = tm_shape(destin_density_map %>%
                         filter(TOTAL_TRIPS>100000))+ 
  tm_fill("TOTAL_TRIPS", 
          style = "cont", 
          palette = "Reds",
          title = "Number of Trips by Destination Bus Stop within the Area",
          alpha = 0.4,
          popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
          popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

tmap_arrange(map_honeycomb, map_honeycomb1, asp=2, ncol=2)

Insights

It is quite surprising to see that there isn’t much alight activity amongst the bus stataion in CBD area, given that this is Weekday morning peak, between 6am to 9am. However, having another thought, this timing is probably more relevant to students going to school (since classes start early from 7 to 8am). Likely the CBD crowd only kicks in from 8-9am, which is significantly lesser in comparison to the whole dataset. We can dive down further (segregate the timing e.g. do one 6-8, one 7-9, to see any difference in results) as the next project.

An interesting sight is that grid_id 5700 has high number of BOTH incoming and outgoing traffic, as reflected by the 301.699 total number of trips as origin area, and 422,497 total number of trips as destination area. This area is likely to be either a bus interchange, or a popular bus stop.

BusStop_Trips1[BusStop_Trips1$ORIGIN_GRID == 5700,]
# A tibble: 529 × 8
# Groups:   ORIGIN_GRID, ORIGIN_BS [3]
   ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS TRIPS FlowNoIntra offset   dist
         <int> <fct>           <int> <chr>     <dbl>       <dbl>  <dbl>  <dbl>
 1        5700 46009            7833 01019        70          70      1 17120.
 2        5700 46009            7930 01059       175         175      1 16759.
 3        5700 46009            7881 01119       190         190      1 16937.
 4        5700 46009            7880 02049        21          21      1 17444.
 5        5700 46009            8024 02089       110         110      1 17910.
 6        5700 46009            8120 02099         1           1      1 18057.
 7        5700 46009            7786 07539        99          99      1 16289.
 8        5700 46009            7882 07589        55          55      1 16434.
 9        5700 46009            6675 14009       183         183      1 19164.
10        5700 46009            6627 14139        16          16      1 18858.
# ℹ 519 more rows

We noted the bus stop is 46009

BusStop[BusStop$BUS_STOP_N == 46009,]
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22836.28 ymin: 46567.63 xmax: 22836.28 ymax: 46567.63
Projected CRS: SVY21 / Singapore TM
     BUS_STOP_N BUS_ROOF_N          LOC_DESC                  geometry
2178      46009        INT WOODLANDS REG INT POINT (22836.28 46567.63)

From the above, we found out that this bus stop is a Interchange station in Woodlands, a densely populated area.

Similarly, we have also noted a few grids where BOTH incoming and outgoing traffic are relatively high. To draw a similar analysis:

list <- intersect(origin_density_map$ORIGIN_GRID[origin_density_map$TOTAL_TRIPS>100000]
        , destin_density_map$DESTIN_GRID[destin_density_map$TOTAL_TRIPS>100000])

So we know that the grid with >100k BOTH incoming and outgoing traffic are:

list
 [1]  2993  4250  4908  5700  7648  7703  8467  9532 10286 10772

Next we find out where are these grid:

BusStop_hexagon[BusStop_hexagon$grid_id %in% list, "LOC_DESC"]
 [1] "TAMPINES INT"                "SERANGOON INT"              
 [3] "TAMPINES INT"                "S'GOON STN EXIT A / BLK 413"
 [5] "WOODLANDS REG INT"           "BOON LAY INT"               
 [7] "S'GOON STN EXIT C / BLK 201" "BEF PUNGGOL RD"             
 [9] "CLEMENTI STN"                "CHOA CHU KANG INT"          
[11] "CLEMENTI STN"                "S'GOON STN EXIT E"          
[13] "LOT 1 / CHOA CHU KANG STN"   "AFT ANG MO KIO STN EXIT A"  
[15] "AFT PUNGGOL RD"              "TOA PAYOH BUS INT"          
[17] "TOA PAYOH BUS INT"           "BEDOK BUS INT"              
[19] "BEDOK BUS INT"               "S'GOON STN EXIT H"          
[21] "ANG MO KIO INT"              "Aft Tampines Stn Exit E"    
[23] "ANG MO KIO STN"              "W'LANDS STN EXIT 4"         
[25] "W'LANDS STN EXIT 5"          "Tampines Stn Exit D"        
[27] "OPP CHOA CHU KANG STN"       "BEDOK STN EXIT A"           

Insights

Noted these are mainly interchange or MRT station so probably make sense as people may change to bus / train at these spot, and there may also be larger number of buses available in these station.

Visualization for the Average distance (weighted) taken at Origin and Destination hexagon area

Now we look at the total average weighted distance traveled.

summary(origin_density_map)
  ORIGIN_GRID     TOTAL_TRIPS      AVERAGE_DIST            geometry   
 Min.   :   52   Min.   :     1   Min.   :  325   POLYGON      :2451  
 1st Qu.: 4688   1st Qu.:   775   1st Qu.: 1517   epsg:3414    :   0  
 Median : 7091   Median :  4603   Median : 2188   +proj=tmer...:   0  
 Mean   : 6762   Mean   : 10579   Mean   : 2426                       
 3rd Qu.: 8908   3rd Qu.: 13280   3rd Qu.: 3051                       
 Max.   :13117   Max.   :365868   Max.   :14379                       
summary(destin_density_map)
  DESTIN_GRID     TOTAL_TRIPS      AVERAGE_DIST            geometry   
 Min.   :   52   Min.   :     1   Min.   :  325   POLYGON      :2456  
 1st Qu.: 4685   1st Qu.:  1543   1st Qu.: 1418   epsg:3414    :   0  
 Median : 7089   Median :  4246   Median : 2099   +proj=tmer...:   0  
 Mean   : 6759   Mean   : 10557   Mean   : 2435                       
 3rd Qu.: 8906   3rd Qu.: 10514   3rd Qu.: 3062                       
 Max.   :13117   Max.   :422497   Max.   :13993                       

We see the mean is around 2500 average distance, and max of around 14,000 distance. As a gauge we will look at data above 8,000 (8km) average distance:

tmap_mode("view")
#tmap_options(check.and.fix = TRUE)
map_honeycomb2 = tm_shape(origin_density_map %>%
                         filter(AVERAGE_DIST>8000)) +
  tm_fill(
    col = "AVERAGE_DIST",
    palette = "Reds",
    style = "cont",
    title = "Average Distance by Origin Bus Stop within the Area",
    alpha = 0.4,
    popup.vars = c("Average DISTANCE travelled from this origin bus stop: " = "AVERAGE_DIST"),
    popup.format = list(AVERAGE_DIST = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

map_honeycomb3 = tm_shape(destin_density_map %>%
                         filter(AVERAGE_DIST>8000)) +
  tm_fill("AVERAGE_DIST", 
          style = "cont", 
          palette = "Reds",
          title = "Number of Trips by Destination Bus Stop within the Area",
          alpha = 0.4,
          popup.vars = c("Average DISTANCE travelled to this destination bus stop: " = "AVERAGE_DIST"),
          popup.format = list(AVERAGE_DIST = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

tmap_arrange(map_honeycomb2, map_honeycomb3, asp=2, ncol=2)
#FOR THOSE >8km AVERAGE TRAVEL DISTANCE WITH THESE ORIGIN BUS STOP
BusStop_hexagon[BusStop_hexagon$grid_id %in% origin_density_map$ORIGIN_GRID[origin_density_map$AVERAGE_DIST>8000], "LOC_DESC"]
 [1] "CHANGI AIRPORT PTB1"     "AFT CHANGI AIRPORT PTB2"
 [3] "AFT LIM CHU KANG LANE 8" "CHANGI AIRPORT PTB3"    
 [5] "CHANGI AIRPORT PTB2"     "PROLOGIS"               
 [7] "DHL"                     "BEF CP E1"              
 [9] "YISHUN DAIRY FARM"       "BEF CP D5"              
[11] "SIGLAP LK"               "Aft Mandai Rd"          
[13] "POLICE COAST GUARD"      "BLK 402"                
[15] "AFT CHANGI FERRY RD"     "OPP CASABLANCA"         
[17] "BEF CHANGI FERRY RD"     "EXPEDITORS"             
[19] "OPP MANDARIN GDNS"       "BAYFRONT STN EXIT B/MBS"
#FOR THOSE >8km AVERAGE TRAVEL DISTANCE WITH THESE DESTINATION BUS STOP
BusStop_hexagon[BusStop_hexagon$grid_id %in% destin_density_map$DESTIN_GRID[destin_density_map$AVERAGE_DIST>8000], "LOC_DESC"]
 [1] "ST ANNE'S CH"              "AIRPORT POLICE STN"       
 [3] "AFT DAIRY FARM RD"         "DHOBY GHAUT STN"          
 [5] "CHANGI AIRPORT PTB1"       "THE SAIL"                 
 [7] "DOWNTOWN STN"              "MARINA BAY FINANCIAL CTR" 
 [9] "AFT LIM CHU KANG LANE 8"   "AFT SLE"                  
[11] "CHANGI AIRPORT PTB3"       "CHANGI AIRPORT PTB2"      
[13] "PROLOGIS"                  "DHL"                      
[15] "TANJONG PAGAR STN EXIT C"  "BEF CHANGI AIRPORT PTB3"  
[17] "MAPLETREE ANSON"           "POLICE COAST GUARD"       
[19] "CHANGI NAVAL BASE"         "OPP CHANGI NAVAL BASE"    
[21] "BEF STEVENS STN EXIT 4"    "AFT STEVENS STN EXIT 5"   
[23] "AFT CHANGI FERRY RD"       "CASABLANCA"               
[25] "LP 94"                     "ORCHARD STN / TANGS PLAZA"
[27] "Marina Bay Stn"            "BEF YISHUN AVE 1"         
[29] "BEF CHANGI FERRY RD"       "OPP DB SCHENKER"          
[31] "OPP ORCHARD STN / ION"     "YMCA"                     
[33] "AFT DHOBY GHAUT STN"       "MENLO WORLDWIDE"          
[35] "EXPEDITORS"                "DHOBY GHAUT STN EXIT B"   
[37] "NEAR SATS FLIGHT KITCHEN" 

Insights

Another interesting observation. For this we will just omit bus stop to and from changi airport, as it can be understood that bus to and from airport are typically of longer distance. The same applies to Changi Naval base/ Opp Changi Naval base as the bus stop is probably serviced by a long distance bus.

The origin bus stops that meet this criteria of >8km average distance are mainly from woodlands/ yishun / causeway. There are also people taking long distance bus and alight at central/ MBFC /Mapletree / CBD /Seletar / Town area, mainly for work.

Visualization of OD FLOW

Creating desire lines

Read the documentation on od2line here

simpleBS_hexagon <- left_join(BusStop_Trips, BusStop_hexagon_sf,
                                 by = c("ORIGIN_GRID" = "grid_id")) %>%
  drop_na() %>%
  group_by(ORIGIN_GRID) %>%
  select(8,1) 

simpleBS_hexagon <- unique(simpleBS_hexagon)
simpleBS_hexagon <- st_sf(geometry = simpleBS_hexagon)
BS_flowLine <- BusStop_Trips1 %>%
  drop_na()  %>%
  group_by(ORIGIN_GRID, DESTIN_GRID) %>% 
  summarize(TOTAL_TRIPS = sum(TRIPS)) %>% 
  filter(TOTAL_TRIPS>1000)
BS_flowLine <- unique(BS_flowLine)
flowLine <- od2line(flow = BS_flowLine,                     
                    origin_code = "ORIGIN_GRID",                     
                    dest_code = "DESTIN_GRID",                     
                    zones = BusStop_hexagon_sf,                     
                    zone_code = "grid_id") 

Visualising the desire lines

To dive down, we will look at OD visualizatiion for total trips >1,000 during the timeframe.

tmap_options(check.and.fix = TRUE)
tm_shape(MPSZ) +   
  tm_polygons() + tm_fill(alpha = 0.1) +
tm_shape(simpleBS_hexagon) +
  tm_polygons() +
flowLine %>%  
tm_shape() +
  tm_lines(lwd = "TOTAL_TRIPS",
           style = "quantile",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3)

It appears that the highest number of long OD flow is between north and east, ALso noted that the dense activity areas, per the above analysis where we noted the high total trips per origin/destination, are also closley linked as one of the OD flow. These areas are typically the interchange or major stops.

Next we will look at the total trips >10,000 during the specific period.

tm_shape(MPSZ) +   
  tm_polygons() + 
tm_shape(simpleBS_hexagon) +
  tm_polygons() +
flowLine %>%  
  filter(TOTAL_TRIPS>10000)%>%
tm_shape() +
  tm_lines(lwd = "TOTAL_TRIPS",
           style = "quantile",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.8)

This is quite interesting, as we see the most obvious OD flow is between the woodland checkpoint to woodland interchange / mrt station area. It is also interesting to see there is long distance OD from woodlands to the major Punggol/seletar area.

Also as we are looking at morning peak, likely it is the secondary school student flow, going to school.

tmap_mode("plot")

Assembling VARIABLES

Next moving on to the assembling at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources. The purpose is to find out what causes the OD flow in certain area, and why is certain areas higher flow than another.

We will look at the following:

  1. Business, entertn, F&B, FinServ, Leisure&Recreation and Retails are geospatial data sets of the locations of business establishments, entertainments, food and beverage outlets, financial centres, leisure and recreation centres, retail and services stores/outlets.

  2. HDB: This data set is the geocoded version of HDB Property Information data from data.gov. The data set is prepared using September 2021 data.

business <- st_read(dsn="data/geospatial",
                  layer="Business")%>%
  st_transform(crs = 3414)
Reading layer `Business' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 6550 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
Projected CRS: SVY21 / Singapore TM
entertn <- st_read(dsn="data/geospatial",
                  layer="entertn") %>%
  st_transform(crs = 3414)
Reading layer `entertn' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 114 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 10809.34 ymin: 26528.63 xmax: 41600.62 ymax: 46375.77
Projected CRS: SVY21 / Singapore TM
fnb <- st_read(dsn="data/geospatial",
                  layer="F&B") %>%
  st_transform(crs = 3414)
Reading layer `F&B' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1919 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6010.495 ymin: 25343.27 xmax: 45462.43 ymax: 48796.21
Projected CRS: SVY21 / Singapore TM
retail <- st_read(dsn="data/geospatial",
                  layer="Retails") %>%
  st_transform(crs = 3414)
Reading layer `Retails' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 37635 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4737.982 ymin: 25171.88 xmax: 48265.04 ymax: 50135.28
Projected CRS: SVY21 / Singapore TM
lnr <- st_read(dsn="data/geospatial",
                  layer="Liesure&Recreation") %>%
  st_transform(crs = 3414)
Reading layer `Liesure&Recreation' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM
fs <- st_read(dsn="data/geospatial",
                  layer="FinServ") %>%
  st_transform(crs = 3414)
Reading layer `FinServ' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3320 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
Projected CRS: SVY21 / Singapore TM

Data wraggling (into hexagon analytical area)

hdb <- read_csv("data/aspatial/hdb.csv")
pop <- read_csv("data/aspatial/pop.csv")

Considering the flow of the student in morning peak likely be in secondary / JC

We will focus on age 13 to 24 therefore will remove the other data set, and remove those zero values.

pop_13 <- pop %>%
  select(1:2,4) %>%
  filter(AGE13_24>0)

This will gives us the PA, SZ, and Total number of Age7_12.

Next we look into the secondary school

school <- read_csv("data/aspatial/Generalinformationofschools.csv")

Note we dont need so many different columns, and we only want to look at secondary / JC:

school_13 <- school %>%
  filter(mainlevel_code != "PRIMARY") %>%
  select(1,3:4,10:11,19:24)
write_rds(school_13, "data/rds/nonPrimarySchool.csv") 

next we will look into the postal code of the school_13 so we can obtain the geometry of this data:

school_13_geo <- left_join(school_13, hdb, 
                 by =c("postal_code" = "postal"))

From the dataset we can see that the postal code not able to find matching postal in hdb data set.

So we use SLA geoservices (finally told of in Class 4… after struggling for 3 endless night) to find the x y.

url<- "https://www.onemap.gov.sg/api/common/elastic/search"

csv <- read_csv("data/aspatial/GeneralInformationofSchools.csv")
postcodes <- csv$'postal_code'

found <- data.frame()
not_found <- data.frame()

for (postcode in postcodes){
  query<- list("searchVal" = postcode, "returnGeom"='Y', 'getAddrDetails'='Y',"pagenum"='1')
  res <-GET(url, query=query)
  
  if((content(res)$found)!=0){
    found<-rbind(found, data.frame(content(res))[4:13])
  }else{
    not_found = data.frame(postcode)
  }
  
}

Because we noted some coordinate, we will manually insert the coordinates to manually fill in the gap.

merged = merge(csv, found, by.x='postal_code', by.y= 'results.POSTAL', all = TRUE)
write_csv(merged, file="data/aspatial/schools.csv")
write_csv(not_found, file ="data/aspatial/not_found.csv")

We manually add in the missing coords and saved it as schools_.csb

schools <- read_csv("data/aspatial/schools_.csv")
schools <- schools %>%
  rename(latitude = "results.LATITUDE",
         longitude = "results.LONGITUDE") %>%
  select(postal_code,school_name, mainlevel_code, latitude, longitude)
schools_sf <- st_as_sf(schools,
                       coords = c("longitude","latitude"),
                       crs = 4326) %>%
  st_transform(crs=3414)

This helps to populate school in planning subzone:

tmap_options(check.and.fix = TRUE)
tm_shape(MPSZ) +
  tm_polygons() +
tm_shape(business) + 
  tm_dots()

BusStop_Trips3 <- left_join(BusStop_Trips1,BusStop_hexagon_sf,
                            by =c("DESTIN_GRID"="grid_id")) %>%
  rename(DESTIN_GEO = geometry)
BusStop_Trips3 = st_sf(geometry = BusStop_Trips3)

We add the various counts of the attarctiveness variables into the df:

BusStop_Trips3$'SCHOOL_COUNT'<- lengths(
  st_intersects(
    BusStop_Trips3, schools_sf
  )
)
BusStop_Trips3$'BUSINESS_COUNT'<- lengths(
  st_intersects(
    BusStop_Trips3, business
  )
)
BusStop_Trips3$'ENTERTN_COUNT'<- lengths(
  st_intersects(
    BusStop_Trips3, entertn
  )
)
BusStop_Trips3$'FNB_COUNT'<- lengths(
  st_intersects(
    BusStop_Trips3, fnb
  )
)
BusStop_Trips3$'FS_COUNT'<- lengths(
  st_intersects(
    BusStop_Trips3, fs
  )
)
BusStop_hexagon_sf$'SCHOOL_COUNT'<- lengths(
  st_intersects(
    BusStop_hexagon_sf, schools_sf
  )
)

We can also extract geo data for those aged 13 to 24. And check if this is one of the contributing factor to OD flow in morning peak.

schools_13_geo <- st_intersection(BusStop_hexagon_sf, schools_sf) %>%
  filter(mainlevel_code != "PRIMARY") 

We will use this later where we map this grid id to the flow data OD of the destin grid.

Now we go find the flow data TO FIND THE ATTRACTIVENESS AT THE DESTINATION!

summary(BusStop_Trips3)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:236683     
 1st Qu.: 5846   84009  :   564   1st Qu.: 5956   Class :character  
 Median : 7637   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7473   52009  :   538   Mean   : 7457                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   444   Max.   :13117                     
                 (Other):233556                                     
     TRIPS          FlowNoIntra          offset              dist      
 Min.   :    1.0   Min.   :    0.0   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1.000000   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1.000000   Median : 3748  
 Mean   :  109.5   Mean   :  109.4   Mean   :0.995475   Mean   : 4865  
 3rd Qu.:   58.0   3rd Qu.:   58.0   3rd Qu.:1.000000   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1.000000   Max.   :24758  
                                                                       
         DESTIN_GEO      SCHOOL_COUNT     BUSINESS_COUNT   ENTERTN_COUNT    
 POLYGON      :236683   Min.   :0.00000   Min.   : 0.000   Min.   :0.00000  
 epsg:3414    :     0   1st Qu.:0.00000   1st Qu.: 0.000   1st Qu.:0.00000  
 +proj=tmer...:     0   Median :0.00000   Median : 0.000   Median :0.00000  
                        Mean   :0.09973   Mean   : 1.445   Mean   :0.09812  
                        3rd Qu.:0.00000   3rd Qu.: 1.000   3rd Qu.:0.00000  
                        Max.   :4.00000   Max.   :40.000   Max.   :8.00000  
                                                                            
   FNB_COUNT         FS_COUNT     
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 0.000   Median : 0.000  
 Mean   : 1.521   Mean   : 2.729  
 3rd Qu.: 0.000   3rd Qu.: 3.000  
 Max.   :96.000   Max.   :68.000  
                                  
BusStop_Trips3 <- unique(BusStop_Trips3)
write_rds(BusStop_Trips3, "data/rds/SIM_data_attrc")

Calibrating Spatial Interaction Models

Visualising the dependent variable (TRIPS)

ggplot(data = BusStop_Trips3,
       aes(x = TRIPS)) +
  geom_histogram()

Notice that the distribution is highly skewed and not resemble bell shape or also known as normal distribution.

Next, let us visualise the relation between the dependent variable and one of the key independent variable in Spatial Interaction Model, namely distance.

ggplot(data = BusStop_Trips3,        
       aes(x = dist,            
           y = TRIPS)) +   
  geom_point() +   
  geom_smooth(method = lm)

Notice that their relationship hardly resemble linear relationship.

On the other hand, if we plot the scatter plot by using the log transformed version of both variables, we can see that their relationship is more resemble linear relationship.

ggplot(data = BusStop_Trips3,        
       aes(x = log(dist),            
           y = log(TRIPS))) +   
  geom_point() +   
  geom_smooth(method = lm)

Checking for variables with zero values

Since Poisson Regression is based of log and log 0 is undefined, it is important for us to ensure that no 0 values in the explanatory variables.

In the code chunk below, summary() of Base R is used to compute the summary statistics of all variables:

summary(BusStop_Trips3)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:236683     
 1st Qu.: 5846   84009  :   564   1st Qu.: 5956   Class :character  
 Median : 7637   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7473   52009  :   538   Mean   : 7457                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   444   Max.   :13117                     
                 (Other):233556                                     
     TRIPS          FlowNoIntra          offset              dist      
 Min.   :    1.0   Min.   :    0.0   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1.000000   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1.000000   Median : 3748  
 Mean   :  109.5   Mean   :  109.4   Mean   :0.995475   Mean   : 4865  
 3rd Qu.:   58.0   3rd Qu.:   58.0   3rd Qu.:1.000000   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1.000000   Max.   :24758  
                                                                       
         DESTIN_GEO      SCHOOL_COUNT     BUSINESS_COUNT   ENTERTN_COUNT    
 POLYGON      :236683   Min.   :0.00000   Min.   : 0.000   Min.   :0.00000  
 epsg:3414    :     0   1st Qu.:0.00000   1st Qu.: 0.000   1st Qu.:0.00000  
 +proj=tmer...:     0   Median :0.00000   Median : 0.000   Median :0.00000  
                        Mean   :0.09973   Mean   : 1.445   Mean   :0.09812  
                        3rd Qu.:0.00000   3rd Qu.: 1.000   3rd Qu.:0.00000  
                        Max.   :4.00000   Max.   :40.000   Max.   :8.00000  
                                                                            
   FNB_COUNT         FS_COUNT     
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 0.000   Median : 0.000  
 Mean   : 1.521   Mean   : 2.729  
 3rd Qu.: 0.000   3rd Qu.: 3.000  
 Max.   :96.000   Max.   :68.000  
                                  

The print report above reveals that variables consist of 0 values.

BusStop_Trips3$ENTERTN_COUNT <- ifelse(
  BusStop_Trips3$ENTERTN_COUNT == 0,
  0.99, BusStop_Trips3$ENTERTN_COUNT)
BusStop_Trips3$BUSINESS_COUNT <- ifelse(
  BusStop_Trips3$BUSINESS_COUNT == 0,
  0.99, BusStop_Trips3$BUSINESS_COUNT)
BusStop_Trips3$SCHOOL_COUNT<- ifelse(
  BusStop_Trips3$SCHOOL_COUNT == 0,
  0.99, BusStop_Trips3$SCHOOL_COUNT)
BusStop_Trips3$FNB_COUNT <- ifelse(
  BusStop_Trips3$FNB_COUNT == 0,
  0.99, BusStop_Trips3$FNB_COUNT)
BusStop_Trips3$FS_COUNT <- ifelse(
  BusStop_Trips3$FS_COUNT == 0,
  0.99, BusStop_Trips3$FS_COUNT)
summary(BusStop_Trips3)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:236683     
 1st Qu.: 5846   84009  :   564   1st Qu.: 5956   Class :character  
 Median : 7637   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7473   52009  :   538   Mean   : 7457                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   444   Max.   :13117                     
                 (Other):233556                                     
     TRIPS          FlowNoIntra          offset              dist      
 Min.   :    1.0   Min.   :    0.0   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1.000000   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1.000000   Median : 3748  
 Mean   :  109.5   Mean   :  109.4   Mean   :0.995475   Mean   : 4865  
 3rd Qu.:   58.0   3rd Qu.:   58.0   3rd Qu.:1.000000   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1.000000   Max.   :24758  
                                                                       
         DESTIN_GEO      SCHOOL_COUNT   BUSINESS_COUNT   ENTERTN_COUNT  
 POLYGON      :236683   Min.   :0.990   Min.   : 0.990   Min.   :0.990  
 epsg:3414    :     0   1st Qu.:0.990   1st Qu.: 0.990   1st Qu.:0.990  
 +proj=tmer...:     0   Median :0.990   Median : 0.990   Median :0.990  
                        Mean   :1.003   Mean   : 2.116   Mean   :1.029  
                        3rd Qu.:0.990   3rd Qu.: 1.000   3rd Qu.:0.990  
                        Max.   :4.000   Max.   :40.000   Max.   :8.000  
                                                                        
   FNB_COUNT         FS_COUNT     
 Min.   : 0.990   Min.   : 0.990  
 1st Qu.: 0.990   1st Qu.: 0.990  
 Median : 0.990   Median : 0.990  
 Mean   : 2.282   Mean   : 3.263  
 3rd Qu.: 0.990   3rd Qu.: 3.000  
 Max.   :96.000   Max.   :68.000  
                                  

Origin (Production) constrained SIM

For the purpose of this exercise where we just want data within the specific hexagon areas, we will drop the NA data, and remember to remove the those INTRAZONAL

flow_data_attrc <- BusStop_Trips3 %>%
  filter(FlowNoIntra>0) %>%
  drop_na() 
summary(flow_data_attrc)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   22009  :   591   Min.   :   52   Length:235612     
 1st Qu.: 5866   84009  :   563   1st Qu.: 5956   Class :character  
 Median : 7639   75009  :   539   Median : 7602   Mode  :character  
 Mean   : 7475   52009  :   538   Mean   : 7459                     
 3rd Qu.: 9087   64009  :   451   3rd Qu.: 9035                     
 Max.   :13117   46009  :   443   Max.   :13117                     
                 (Other):232487                                     
     TRIPS          FlowNoIntra          offset       dist      
 Min.   :    1.0   Min.   :    1.0   Min.   :1   Min.   :  325  
 1st Qu.:    3.0   1st Qu.:    3.0   1st Qu.:1   1st Qu.: 1810  
 Median :   15.0   Median :   15.0   Median :1   Median : 3748  
 Mean   :  109.9   Mean   :  109.9   Mean   :1   Mean   : 4887  
 3rd Qu.:   59.0   3rd Qu.:   59.0   3rd Qu.:1   3rd Qu.: 6917  
 Max.   :96630.0   Max.   :96630.0   Max.   :1   Max.   :24758  
                                                                
         DESTIN_GEO      SCHOOL_COUNT   BUSINESS_COUNT   ENTERTN_COUNT 
 POLYGON      :235612   Min.   :0.990   Min.   : 0.990   Min.   :0.99  
 epsg:3414    :     0   1st Qu.:0.990   1st Qu.: 0.990   1st Qu.:0.99  
 +proj=tmer...:     0   Median :0.990   Median : 0.990   Median :0.99  
                        Mean   :1.003   Mean   : 2.119   Mean   :1.03  
                        3rd Qu.:0.990   3rd Qu.: 1.000   3rd Qu.:0.99  
                        Max.   :4.000   Max.   :40.000   Max.   :8.00  
                                                                       
   FNB_COUNT         FS_COUNT     
 Min.   : 0.990   Min.   : 0.990  
 1st Qu.: 0.990   1st Qu.: 0.990  
 Median : 0.990   Median : 0.990  
 Mean   : 2.286   Mean   : 3.266  
 3rd Qu.: 0.990   3rd Qu.: 3.000  
 Max.   :96.000   Max.   :68.000  
                                  

Unconstrained Spatial Interaction Model

uncSIM <- glm(formula = TRIPS ~ 
                log(SCHOOL_COUNT) + 
                log(BUSINESS_COUNT) +
                log(ENTERTN_COUNT) +
                log(FNB_COUNT) +
                log(FS_COUNT) +
                log(dist),
              family = poisson(link = "log"),
              data = flow_data_attrc,
              na.action = na.exclude)
uncSIM

Call:  glm(formula = TRIPS ~ log(SCHOOL_COUNT) + log(BUSINESS_COUNT) + 
    log(ENTERTN_COUNT) + log(FNB_COUNT) + log(FS_COUNT) + log(dist), 
    family = poisson(link = "log"), data = flow_data_attrc, na.action = na.exclude)

Coefficients:
        (Intercept)    log(SCHOOL_COUNT)  log(BUSINESS_COUNT)  
           10.43605              0.65780              0.07793  
 log(ENTERTN_COUNT)       log(FNB_COUNT)        log(FS_COUNT)  
            0.07513             -0.38256              0.45108  
          log(dist)  
           -0.76930  

Degrees of Freedom: 235611 Total (i.e. Null);  235605 Residual
Null Deviance:      102600000 
Residual Deviance: 83850000     AIC: 84940000

The results show that FnB is weak, which makes sense as people may not go far for FNB in early morning peak.

In this case i have then removed the FNB_COUNT, ENTERTN_count and rerun the following

uncSIM <- glm(formula = TRIPS ~ 
                log(SCHOOL_COUNT) + 
                log(BUSINESS_COUNT) +
                log(FS_COUNT) +
                log(dist), 
              family = poisson(link = "log"),
              data = flow_data_attrc,
              na.action = na.exclude)
uncSIM

Call:  glm(formula = TRIPS ~ log(SCHOOL_COUNT) + log(BUSINESS_COUNT) + 
    log(FS_COUNT) + log(dist), family = poisson(link = "log"), 
    data = flow_data_attrc, na.action = na.exclude)

Coefficients:
        (Intercept)    log(SCHOOL_COUNT)  log(BUSINESS_COUNT)  
           10.57926              0.62138             -0.02018  
      log(FS_COUNT)            log(dist)  
            0.34559             -0.78523  

Degrees of Freedom: 235611 Total (i.e. Null);  235607 Residual
Null Deviance:      102600000 
Residual Deviance: 84950000     AIC: 86030000

R-squared function

CalcRSquared <- function(observed,estimated){
  r <- cor(observed,estimated)
  R2 <- r^2
  R2
}
CalcRSquared(uncSIM$data$TRIPS, uncSIM$fitted.values)
[1] 0.02424408
r2_mcfadden(uncSIM)
# R2 for Generalized Linear Regression
       R2: 0.170
  adj. R2: 0.170

Origin (Production) constrained SIM

orcSIM <- glm(formula = TRIPS ~ 
                ORIGIN_GRID +
                log(SCHOOL_COUNT) + 
                log(BUSINESS_COUNT) +
                log(FS_COUNT) +
                log(dist)-1, 
              family = poisson(link = "log"),
              data = flow_data_attrc,
              na.action = na.exclude)
summary(orcSIM)

Call:
glm(formula = TRIPS ~ ORIGIN_GRID + log(SCHOOL_COUNT) + log(BUSINESS_COUNT) + 
    log(FS_COUNT) + log(dist) - 1, family = poisson(link = "log"), 
    data = flow_data_attrc, na.action = na.exclude)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
ORIGIN_GRID          1.536e-04  9.392e-08  1635.3   <2e-16 ***
log(SCHOOL_COUNT)    5.168e-01  2.156e-03   239.7   <2e-16 ***
log(BUSINESS_COUNT) -9.329e-02  2.931e-04  -318.3   <2e-16 ***
log(FS_COUNT)        3.246e-01  1.997e-04  1625.4   <2e-16 ***
log(dist)            3.792e-01  9.768e-05  3882.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 294586982  on 235612  degrees of freedom
Residual deviance: 124835760  on 235607  degrees of freedom
AIC: 125921844

Number of Fisher Scoring iterations: 8

We can examine how the constraints hold for destinations this time

CalcRSquared(orcSIM$data$TRIPS, orcSIM$fitted.values)
[1] 0.003086618

Destination constrained

decSIM <- glm(formula = TRIPS ~ 
                DESTIN_GRID +                 
                log(SCHOOL_COUNT) + 
                log(BUSINESS_COUNT) +
                log(FS_COUNT) +
                log(dist) -1, 
              family = poisson(link = "log"),
              data = flow_data_attrc,

              na.action = na.exclude)
summary(decSIM)

Call:
glm(formula = TRIPS ~ DESTIN_GRID + log(SCHOOL_COUNT) + log(BUSINESS_COUNT) + 
    log(FS_COUNT) + log(dist) - 1, family = poisson(link = "log"), 
    data = flow_data_attrc, na.action = na.exclude)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
DESTIN_GRID          1.757e-04  9.650e-08  1820.5   <2e-16 ***
log(SCHOOL_COUNT)    4.719e-01  2.154e-03   219.1   <2e-16 ***
log(BUSINESS_COUNT) -9.958e-02  2.922e-04  -340.8   <2e-16 ***
log(FS_COUNT)        3.360e-01  2.010e-04  1671.4   <2e-16 ***
log(dist)            3.579e-01  1.016e-04  3523.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 294586982  on 235612  degrees of freedom
Residual deviance: 124222147  on 235607  degrees of freedom
AIC: 125308231

Number of Fisher Scoring iterations: 8

We can examine how the constraints hold for destinations this time.

CalcRSquared(decSIM$data$TRIPS, decSIM$fitted.values)
[1] 0.002440099

Doubly constrained

dbcSIM <- glm(formula = TRIPS ~ 
                ORIGIN_GRID + 
                DESTIN_GRID + 
                log(dist),
              family = poisson(link = "log"),
              data = flow_data_attrc,
              na.action = na.exclude)
summary(dbcSIM)

Call:
glm(formula = TRIPS ~ ORIGIN_GRID + DESTIN_GRID + log(dist), 
    family = poisson(link = "log"), data = flow_data_attrc, na.action = na.exclude)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.073e+01  1.537e-03  6981.4   <2e-16 ***
ORIGIN_GRID -6.525e-05  2.204e-07  -296.0   <2e-16 ***
DESTIN_GRID  3.465e-05  2.210e-07   156.8   <2e-16 ***
log(dist)   -7.463e-01  1.947e-04 -3832.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 102582447  on 235611  degrees of freedom
Residual deviance:  87803733  on 235608  degrees of freedom
AIC: 88889815

Number of Fisher Scoring iterations: 7

Model comparison

Using compare_performance() of performance package

model_list <- list(unconstrained=uncSIM,
                   originConstrained=orcSIM,
                   destinationConstrained=decSIM,
                   doublyConstrained=dbcSIM)
compare_performance(model_list,
                    metrics = "RMSE")
# Comparison of Model Performance Indices

Name                   | Model |    RMSE
----------------------------------------
unconstrained          |   glm | 626.384
originConstrained      |   glm | 640.759
destinationConstrained |   glm | 640.312
doublyConstrained      |   glm | 629.339

This compute the RMSE of all the models in model_list file.

The print above reveals that all the SIMs are of similar value.

Visualising fitted

We will extract the fitted values from each model, join the values to flow_data_attrc data frame, for all the models.

df <- as.data.frame(uncSIM$fitted.values) %>%
  round(digits = 0) 
flow_data_attrc <- flow_data_attrc %>% 
  cbind(df) %>% 
  rename(uncTRIPS = "uncSIM.fitted.values")
df <- as.data.frame(orcSIM$fitted.values) %>% 
  round(digits = 0) 

flow_data_attrc <- flow_data_attrc %>% 
  cbind(df) %>% 
  rename(orcTRIPS = "orcSIM.fitted.values")
df <- as.data.frame(decSIM$fitted.values) %>% 
  round(digits = 0) 

flow_data_attrc <- flow_data_attrc %>% 
  cbind(df) %>% 
  rename(decTRIPS = "decSIM.fitted.values")
df <- as.data.frame(dbcSIM$fitted.values) %>% 
  round(digits = 0) 

flow_data_attrc <- flow_data_attrc %>% 
  cbind(df) %>% 
  rename(dbcTRIPS = "dbcSIM.fitted.values")

Now we plot it:

unc_p <- ggplot(data = flow_data_attrc,
                aes(x = uncTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)

orc_p <- ggplot(data = flow_data_attrc,
                aes(x = orcTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)

dec_p <- ggplot(data = flow_data_attrc,
                aes(x = decTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)

dbc_p <- ggplot(data = flow_data_attrc,
                aes(x = dbcTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)

ggarrange(unc_p, orc_p, dec_p, dbc_p,
          ncol = 2,
          nrow = 2)

From the figures above,

We can see that School, Business and Financial Services are key major driver for attractive variable when it comes to a single destination constrain, and the single origin constrain.

For destination constraint, It shows the movement from multiple origins to specific destinations (likely to key areas such as MRT interchange, work places, etc). This helps us to consider the constraints at the destination points, ensuring efficient planning of routes to cater to these flow during morning peak.that goods, services, or people reach these specific endpoints efficiently.

For origin constraints, it shows movement from specific origins to multiple potential destinations. We can defer from my previous analysis (at the very start of this exercise) that the single origins are key ares such as mrt interchange, the woodland checkpoints, and certain key residential areas with high population density. This helps us to considers constraints primarily at the origins, ensuring that resources and routes are planned optimally from these starting points to various possible endpoints.

Future Work - Train Station

Due to time contraints I am unable to complete the whole analysis using train station / MRT, it is actually interesting to link this to find out the OD flow of weekday morning peak to the school, since the data on school information does have the respective MRT stations nearby. I have done abit of data cleaning and wraggling but couldnt havecomplete the analysis. Will look into this in the future when have some free time (after this course, sadly).

TrainStation <- st_read(dsn="data/geospatial",                   
                        layer="RapidTransitSystemStation")%>%   
  st_transform(crs = 3414)
Reading layer `RapidTransitSystemStation' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21

I see that TrainStation is a Polygon geometry SF with “TYP_CD”, “STN_NAM”, “TYP_CD_DES”, “STN_NAM_DE”, and its geometry.

Note that “TYP_CD” is of all 0 value, “STN_NAM” is of all NA value, while “TYP_CD_DES” value is “MRT” or “LRT”.

Also note that the above shows a warning message:

Warning: GDAL Message 1: Non closed ring detected. To avoid accepting it, set the OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
invalid_geoms <- TrainStation[!st_is_valid(TrainStation), ]

It appears to have 3 invalid geoms, apart form the NA entry, we can see that the both “HARBOURFRONT MRT STATION” and “UPPER THOMSON MRT STATION” are the invalid geometries.

For the purpose of this exercise, I will filter out these entries and retain on the valid geometries in TrainStation.

TrainStation <- TrainStation[st_is_valid(TrainStation), ]%>%   
  st_transform(crs = 3414)
TrainStation_Exit <- st_read(dsn="data/geospatial",                 
                             layer="Train_Station_Exit_Layer")%>%   
  st_transform(crs = 3414)
Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21

I see that TrainStation_Exit is a Point geometry SF with “stn_name”, “exit_code”, and its geometry.

I get aspatial Data of PASSENGER VOLUME BY ORIGIN DESTINATION TRAIN STATIONS, downloaded via API (postman GET) from Data Mall LTA. For the purpose of this exercise the Aug 2023 Data will be used.

OD_train <- read_csv("data/aspatial/origin_destination_train_202308.csv") #non-spatial data with no geometry features
OD_train$ORIGIN_PT_CODE <- as.factor(OD_train$ORIGIN_PT_CODE)
OD_train$DESTINATION_PT_CODE <- as.factor(OD_train$DESTINATION_PT_CODE)

Noted that both the aspatial data, for bus and train, are similar where there are total of 7 columns YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE, TIME_PER_HOUR, TOTAL_TRIPS.

We will now first combine the SF data sources relating to Bus Stop, using st_intersection():

TrainStation_combined <- st_intersection(TrainStation, MPSZ)

Before proceeding it will be good practice to check the crs of each using st_crs.

We will now first combine the SF data sources relating to Train Station, using st_intersection():

TrainStation_combined <- st_intersection(TrainStation_combined, TrainStation_Exit) %>%
  mutate(STN_NAM_DE = str_replace(STN_NAM_DE, "MRT STATION", "")) %>%
  mutate(STN_NAM_DE = str_replace(STN_NAM_DE, "LRT STATION", ""))%>%
  select(3:13) %>%
  mutate(STN_NAM_DE = str_trim(STN_NAM_DE))

Note this data frame does not include the two MRT stations we filtered out earlier.

Note that we have removed the first two column (NA and 0 values), in order to associate the train station code, we will import the following data set and join them:

TrainStation_Code <- readxl::read_excel("data/aspatial/Train Station Codes and Chinese Names.xls")

TrainStation_Code$mrt_station_english <- toupper(TrainStation_Code$mrt_station_english)
TrainStation_combined <- left_join(TrainStation_combined, 
                              TrainStation_Code, 
                              by = c("STN_NAM_DE" = "mrt_station_english"))

For reproducibility, will save into a rds file:

write_rds(TrainStation_combined, "data/rds/TrainStation_combined.csv")  

Next, we are going to append the planning subzone code from TrainStation_combined data frame onto OD_train data frame.

Origin_TrainStation <- left_join(OD_train , TrainStation_combined,
            by = c("ORIGIN_PT_CODE" = "stn_code")) %>%
  select(1:11,16) 

Looking at the results, noted there are a number of mismatches due to the origin PT code having multiple values. We will need to look into this and clean up the dataset for better visualiztion.

OD_train_filter <- OD_train %>%
  mutate(ORIGIN_PT_CODE = str_replace(ORIGIN_PT_CODE, "/.*", "")) %>%
  mutate(DESTINATION_PT_CODE = str_replace(DESTINATION_PT_CODE, "/.*", ""))

This removes the additional train codes with multiple lines (e.g. Botanic Garden is both Circle line and Downtown line), and only retain one, so to avoid double counting. Then we run the left_join using this dataframe:

Origin_TrainStation_filter <- left_join(OD_train_filter , TrainStation_combined,
            by = c("ORIGIN_PT_CODE" = "stn_code")) %>%
  select(1:11,16) 

Own Notes

Constructing an O/D Matrix

  • In O/D matrix, the Ti sum of row representing total output of origin location, while sum of column represents input of destination.

  • Can have sub-matrix

  • Additional new activity may change this to new structure

  • O/D Matrix can be costly, 322 x 322 = 110k O/D pairs, which each info need to be carefully provided (but may also change).

  • GPS/ Smart card can collect personal info to represent flows between locations

There are 3 Spatial Interaction Models (First 2 steps)

  • Basic assumption is that function of attributes of location of origin and destination and frictions

  • Potential Model is usually for measuring accessibility

  • Retail model usually used for franchise to choose service area of store of delivery segment

  • This course we will focus on Gravity Model which is the most common model. It uses Newton first law of gravity. Estimated Tij is transition/trip or flow between origin i (row) and destination j(columns). Parameters V, W, d, k lamda, alpha and beta. Beta is always assumed to be negative as increase in cost.distance will likely decrease the interaction.

  • A family of gravity Models - Unconstrained (totally constrained), origin constrained, destination constrained, doubly constrained.

We can see that sp is in list, no geometric column, they are segregated as different table inside object, In tidyverse it is in a whole table. But to call a field in sp, we will need to write something like below

  select(mpsz@data$SUBZONE) 

(Work done on Previous Hands-On Exercise Version)

Note that these were done on the previous version of Hands-On Exercise before the updates, just FYI so I remember I did this previously.

Assembling VARIABLES

Next moving on to the assemblling at least three propulsive and three attractiveness variables by using aspatial and geospatial from publicly available sources.

We will look at the following:

  1. Population of Age 7 to 12

  2. School (Primary School) Types

  3. School distinctive programme

  4. HDB Information - Occupancy rate

  5. MPSZ

pop <- read_csv("data/aspatial/pop.csv")

We will focus on age 7 to 12 therefore will remove the other data set, and remove those zero values.

pop <- pop %>%   
  select(1:3) %>%   
  filter(AGE7_12>0)   

This will gives us the PA, SZ, and Total number of Age7_12.

Next we look into the primary school

school <- read_csv("data/aspatial/Generalinformationofschools.csv")

Note we dont need so many different columns, and we only want to look at primary school:

school <- school %>%   
  filter(mainlevel_code == "PRIMARY") %>%   
  select(1,3:4,10:11,19:31)
write_rds(school, "data/rds/PrimarySchool.csv") 

next will look into the vairous school distinctive programme

distinctive <- read_csv("data/aspatial/SchoolDistinctiveProgrammes.csv")

we will just look into the alp_domain for all primary school

table(distinctive$alp_domain)

                 Aesthetics Business & Entrepreneurship 
                          7                           5 
                     Coding                  Humanities 
                         11                           3 
                ICT & Media     Innovation & Enterprise 
                          8                           6 
          Interdisciplinary                   Languages 
                         40                          34 
     Languages & Humanities                 Mathematics 
                         19                           2 
                       NULL                     Science 
                         55                          27 
                       STEM 
                         71 

Combining this to our primary school data

primary_school <- left_join(school,distinctive) 

There are three school without the corresponding distinctive, this is likely due to the primary school is new, we will confirm on this later. The school database is updated Nov 2023, while the data in distinctive was updated in 2021.

The few distinctive programme will serve as key variables in this dataset.

Next we will go back to pop data set:

pop <- left_join(pop, MPSZ,                   
                 by =c("SZ" = "SUBZONE_N"))

We do the same to the primary school

primary_school_MSPZ <- left_join(primary_school, MPSZ,                              by =c("dgp_code" = "PLN_AREA_N"))

Noted that the data got NA for those dgp_code = SENG KANG, due to discrepencies in format. Therefore we will amend this and rerun

primary_school$dgp_code[primary_school$dgp_code == "SENG KANG"] = "SENGKANG"
primary_school_MSPZ <- left_join(primary_school, MPSZ,                              by =c("dgp_code" = "PLN_AREA_N"))%>%   
  select(1,4:5,26:27,8,19:23,30)

Now we have a primary_school_MSPZ data set that show each primary school, their distinctive programme, and the associated SZ location,and the pop data set that has population aged 7 to 12 by SZ location.

write_rds(primary_school_MSPZ, "data/rds/primary_school_MSPZ.csv") 

We will look into this later.

Preparing Origin and Destination Attributes

Preparing origin attribute

pop <- read_csv("data/aspatial/pop.csv")
pop <- pop %>%   
  left_join(MPSZ,             
            by = c("PA" = "PLN_AREA_N",                    
                   "SZ" = "SUBZONE_N")) %>%   
  select(1:6) %>%   
  rename(SZ_NAME = SZ,          
         SZ = SUBZONE_C)
mpsz_hexagon <- st_intersection(BusStop_hexagon_sf, MPSZ) %>%   
  drop_na()
flow_data <- left_join(BusStop_Trips1,mpsz_hexagon,                         
                       by =c("ORIGIN_GRID" = "grid_id")) %>%   
  rename(ORIGIN_SZ = SUBZONE_C,          
         ORIGIN_SZ_NAME = SUBZONE_N)    
flow_data <- unique(flow_data)
flow_data <- left_join(flow_data,mpsz_hexagon,                         
                       by =c("DESTIN_GRID" = "grid_id")) %>%   
  rename(DESTIN_SZ = SUBZONE_C,          
         DESTIN_SZ_NAME = SUBZONE_N)    
flow_data <- unique(flow_data)
flow_data1 <- flow_data %>%   
  left_join(pop,             
            by = c(ORIGIN_SZ = "SZ")) %>%   
  rename(ORIGIN_AGE7_12 = AGE7_12,          
         ORIGIN_AGE13_24 = AGE13_24,          
         ORIGIN_AGE25_64 = AGE25_64) %>%   
  select(-c(PA, SZ_NAME))
flow_data1 <- flow_data1 %>%   
  left_join(pop,             
            by = c(DESTIN_SZ = "SZ")) %>%   
  rename(DESTIN_AGE7_12 = AGE7_12,          
         DESTIN_AGE13_24 = AGE13_24,          
         DESTIN_AGE25_64 = AGE25_64) %>%   
  select(-c(PA, SZ_NAME))
flow_data1 <- unique(flow_data1)
write_rds(flow_data1, "data/rds/SIM_data")

Calibrating Spatial Interaction Models

Visualising the dependent variable (TRIPS)

ggplot(data = flow_data1,        
       aes(x = TRIPS)) +   
  geom_histogram()

Notice that the distribution is highly skewed and not resemble bell shape or also known as normal distribution.

Next, let us visualise the relation between the dependent variable and one of the key independent variable in Spatial Interaction Model, namely distance.

ggplot(data = flow_data1,                
       aes(x = dist,                        
           y = TRIPS)) +      
  geom_point() +      
  geom_smooth(method = lm)

Notice that their relationship hardly resemble linear relationship.

On the other hand, if we plot the scatter plot by using the log transformed version of both variables, we can see that their relationship is more resemble linear relationship.

ggplot(data = flow_data1,                
       aes(x = log(dist),                        
           y = log(TRIPS))) +      
  geom_point() +      
  geom_smooth(method = lm)

Checking for variables with zero values

Since Poisson Regression is based of log and log 0 is undefined, it is important for us to ensure that no 0 values in the explanatory variables.

In the code chunk below, summary() of Base R is used to compute the summary statistics of all variables:

summary(flow_data1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   54009  :  3448   Min.   :   52   Length:985230     
 1st Qu.: 5916   84009  :  3132   1st Qu.: 6056   Class :character  
 Median : 7473   46009  :  2565   Median : 7494   Mode  :character  
 Mean   : 7293   17009  :  2349   Mean   : 7287                     
 3rd Qu.: 8651   60179  :  2345   3rd Qu.: 8517                     
 Max.   :13117   22009  :  2242   Max.   :13117                     
                 (Other):969149                                     
     TRIPS           FlowNoIntra           offset              dist      
 Min.   :    1.00   Min.   :    0.00   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.00   1st Qu.:    3.00   1st Qu.:1.000000   1st Qu.: 2030  
 Median :   14.00   Median :   14.00   Median :1.000000   Median : 3994  
 Mean   :   97.22   Mean   :   97.05   Mean   :0.995711   Mean   : 5027  
 3rd Qu.:   53.00   3rd Qu.:   53.00   3rd Qu.:1.000000   3rd Qu.: 7083  
 Max.   :96630.00   Max.   :96630.00   Max.   :1.000000   Max.   :24758  
                                                                         
 SCHOOL_COUNT.x    ORIGIN_SZ_NAME      ORIGIN_SZ         PLN_AREA_N.x      
 Min.   :0.00000   Length:985230      Length:985230      Length:985230     
 1st Qu.:0.00000   Class :character   Class :character   Class :character  
 Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.09971                                                           
 3rd Qu.:0.00000                                                           
 Max.   :4.00000                                                           
 NA's   :27                                                                
 PLN_AREA_C.x        REGION_N.x         REGION_C.x       
 Length:985230      Length:985230      Length:985230     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
              geometry.x     SCHOOL_COUNT.y    DESTIN_SZ_NAME    
 GEOMETRYCOLLECTION:    27   Min.   :0.00000   Length:985230     
 MULTIPOLYGON      :  5713   1st Qu.:0.00000   Class :character  
 POLYGON           :979490   Median :0.00000   Mode  :character  
 epsg:3414         :     0   Mean   :0.09136                     
 +proj=tmer...     :     0   3rd Qu.:0.00000                     
                             Max.   :4.00000                     
                             NA's   :14                          
  DESTIN_SZ         PLN_AREA_N.y       PLN_AREA_C.y        REGION_N.y       
 Length:985230      Length:985230      Length:985230      Length:985230     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  REGION_C.y                     geometry.y     ORIGIN_AGE7_12 ORIGIN_AGE13_24
 Length:985230      GEOMETRYCOLLECTION:    14   Min.   :   0   Min.   :    0  
 Class :character   MULTIPOLYGON      :  6750   1st Qu.: 180   1st Qu.:  410  
 Mode  :character   POLYGON           :978466   Median : 730   Median : 1580  
                    epsg:3414         :     0   Mean   :1088   Mean   : 2434  
                    +proj=tmer...     :     0   3rd Qu.:1610   3rd Qu.: 3470  
                                                Max.   :6340   Max.   :16380  
                                                NA's   :27     NA's   :27     
 ORIGIN_AGE25_64 DESTIN_AGE7_12   DESTIN_AGE13_24 DESTIN_AGE25_64
 Min.   :    0   Min.   :   0.0   Min.   :    0   Min.   :    0  
 1st Qu.: 1860   1st Qu.:  90.0   1st Qu.:  170   1st Qu.:  900  
 Median : 7370   Median : 620.0   Median : 1300   Median : 6190  
 Mean   :11264   Mean   : 986.4   Mean   : 2200   Mean   :10207  
 3rd Qu.:16780   3rd Qu.:1480.0   3rd Qu.: 3290   3rd Qu.:15830  
 Max.   :74610   Max.   :6340.0   Max.   :16380   Max.   :74610  
 NA's   :27      NA's   :14       NA's   :14      NA's   :14     

The print report above reveals that variables ORIGIN_AGE7_12, ORIGIN_AGE13_24, ORIGIN_AGE25_64,DESTIN_AGE7_12, DESTIN_AGE13_24, DESTIN_AGE25_64 consist of 0 values.

flow_data1$DESTIN_AGE7_12 <- ifelse(   
  flow_data1$DESTIN_AGE7_12 == 0,   
  0.99, flow_data1$DESTIN_AGE7_12) 
flow_data1$DESTIN_AGE13_24 <- ifelse(   
  flow_data1$DESTIN_AGE13_24 == 0,   
  0.99, flow_data1$DESTIN_AGE13_24) 
flow_data1$DESTIN_AGE25_64 <- ifelse(   
  flow_data1$DESTIN_AGE25_64 == 0,   
  0.99, flow_data1$DESTIN_AGE25_64) 
flow_data1$ORIGIN_AGE7_12 <- ifelse(   
  flow_data1$ORIGIN_AGE7_12 == 0,   
  0.99, flow_data1$ORIGIN_AGE7_12) 
flow_data1$ORIGIN_AGE13_24 <- ifelse(   
  flow_data1$ORIGIN_AGE13_24 == 0,   
  0.99, flow_data1$ORIGIN_AGE13_24) 
flow_data1$ORIGIN_AGE25_64 <- ifelse(   
  flow_data1$ORIGIN_AGE25_64 == 0,   
  0.99, flow_data1$ORIGIN_AGE25_64)
summary(flow_data1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   54009  :  3448   Min.   :   52   Length:985230     
 1st Qu.: 5916   84009  :  3132   1st Qu.: 6056   Class :character  
 Median : 7473   46009  :  2565   Median : 7494   Mode  :character  
 Mean   : 7293   17009  :  2349   Mean   : 7287                     
 3rd Qu.: 8651   60179  :  2345   3rd Qu.: 8517                     
 Max.   :13117   22009  :  2242   Max.   :13117                     
                 (Other):969149                                     
     TRIPS           FlowNoIntra           offset              dist      
 Min.   :    1.00   Min.   :    0.00   Min.   :0.000001   Min.   :    0  
 1st Qu.:    3.00   1st Qu.:    3.00   1st Qu.:1.000000   1st Qu.: 2030  
 Median :   14.00   Median :   14.00   Median :1.000000   Median : 3994  
 Mean   :   97.22   Mean   :   97.05   Mean   :0.995711   Mean   : 5027  
 3rd Qu.:   53.00   3rd Qu.:   53.00   3rd Qu.:1.000000   3rd Qu.: 7083  
 Max.   :96630.00   Max.   :96630.00   Max.   :1.000000   Max.   :24758  
                                                                         
 SCHOOL_COUNT.x    ORIGIN_SZ_NAME      ORIGIN_SZ         PLN_AREA_N.x      
 Min.   :0.00000   Length:985230      Length:985230      Length:985230     
 1st Qu.:0.00000   Class :character   Class :character   Class :character  
 Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.09971                                                           
 3rd Qu.:0.00000                                                           
 Max.   :4.00000                                                           
 NA's   :27                                                                
 PLN_AREA_C.x        REGION_N.x         REGION_C.x       
 Length:985230      Length:985230      Length:985230     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
              geometry.x     SCHOOL_COUNT.y    DESTIN_SZ_NAME    
 GEOMETRYCOLLECTION:    27   Min.   :0.00000   Length:985230     
 MULTIPOLYGON      :  5713   1st Qu.:0.00000   Class :character  
 POLYGON           :979490   Median :0.00000   Mode  :character  
 epsg:3414         :     0   Mean   :0.09136                     
 +proj=tmer...     :     0   3rd Qu.:0.00000                     
                             Max.   :4.00000                     
                             NA's   :14                          
  DESTIN_SZ         PLN_AREA_N.y       PLN_AREA_C.y        REGION_N.y       
 Length:985230      Length:985230      Length:985230      Length:985230     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  REGION_C.y                     geometry.y     ORIGIN_AGE7_12   
 Length:985230      GEOMETRYCOLLECTION:    14   Min.   :   0.99  
 Class :character   MULTIPOLYGON      :  6750   1st Qu.: 180.00  
 Mode  :character   POLYGON           :978466   Median : 730.00  
                    epsg:3414         :     0   Mean   :1088.08  
                    +proj=tmer...     :     0   3rd Qu.:1610.00  
                                                Max.   :6340.00  
                                                NA's   :27       
 ORIGIN_AGE13_24    ORIGIN_AGE25_64    DESTIN_AGE7_12    DESTIN_AGE13_24   
 Min.   :    0.99   Min.   :    0.99   Min.   :   0.99   Min.   :    0.99  
 1st Qu.:  410.00   1st Qu.: 1860.00   1st Qu.:  90.00   1st Qu.:  170.00  
 Median : 1580.00   Median : 7370.00   Median : 620.00   Median : 1300.00  
 Mean   : 2434.32   Mean   :11264.18   Mean   : 986.60   Mean   : 2200.29  
 3rd Qu.: 3470.00   3rd Qu.:16780.00   3rd Qu.:1480.00   3rd Qu.: 3290.00  
 Max.   :16380.00   Max.   :74610.00   Max.   :6340.00   Max.   :16380.00  
 NA's   :27         NA's   :27         NA's   :14        NA's   :14        
 DESTIN_AGE25_64   
 Min.   :    0.99  
 1st Qu.:  900.00  
 Median : 6190.00  
 Mean   :10207.40  
 3rd Qu.:15830.00  
 Max.   :74610.00  
 NA's   :14        

Origin (Production) constrained SIM

from the summary table above we noted tehre are NA’s.

flow_data1[is.na(flow_data1$ORIGIN_AGE7_12),] 
# A tibble: 27 × 30
# Groups:   ORIGIN_GRID, ORIGIN_BS [3]
   ORIGIN_GRID ORIGIN_BS DESTIN_GRID DESTIN_BS TRIPS FlowNoIntra   offset  dist
         <int> <fct>           <int> <chr>     <dbl>       <dbl>    <dbl> <dbl>
 1        4224 46239            5078 46069         1           1 1        6344.
 2        4224 46239            5125 46088         2           2 1        6668.
 3        4224 46239            5125 46088         2           2 1        6668.
 4        4224 46239            5126 46109         9           9 1        6175 
 5        4224 46239            5033 46219      3381        3381 1        4585.
 6        5033 46211            5126 46109        14          14 1        1720.
 7        5033 46211            5033 46219         2           0 0.000001    0 
 8        5033 46211            4224 46239      5676        5676 1        4585.
 9        5033 46219            4825 44049         3           3 1        9030.
10        5033 46219            4825 44049         3           3 1        9030.
# ℹ 17 more rows
# ℹ 22 more variables: SCHOOL_COUNT.x <int>, ORIGIN_SZ_NAME <chr>,
#   ORIGIN_SZ <chr>, PLN_AREA_N.x <chr>, PLN_AREA_C.x <chr>, REGION_N.x <chr>,
#   REGION_C.x <chr>, geometry.x <GEOMETRYCOLLECTION [m]>,
#   SCHOOL_COUNT.y <int>, DESTIN_SZ_NAME <chr>, DESTIN_SZ <chr>,
#   PLN_AREA_N.y <chr>, PLN_AREA_C.y <chr>, REGION_N.y <chr>, REGION_C.y <chr>,
#   geometry.y <GEOMETRY [m]>, ORIGIN_AGE7_12 <dbl>, ORIGIN_AGE13_24 <dbl>, …

For the purpose of this exercise where we just want data within the specific hexagon areas, we will drop the NA data, and remember to remove the those whiten the same hexagon (dist = 0)

flow_data1 <- flow_data1 %>%   
  drop_na() %>%   
  filter(dist >0) 
summary(flow_data1)
  ORIGIN_GRID      ORIGIN_BS       DESTIN_GRID     DESTIN_BS        
 Min.   :   52   54009  :  3416   Min.   :   52   Length:980967     
 1st Qu.: 5920   84009  :  3123   1st Qu.: 6056   Class :character  
 Median : 7492   46009  :  2556   Median : 7494   Mode  :character  
 Mean   : 7295   60179  :  2345   Mean   : 7289                     
 3rd Qu.: 8651   17009  :  2331   3rd Qu.: 8517                     
 Max.   :13117   22009  :  2242   Max.   :13117                     
                 (Other):964954                                     
     TRIPS           FlowNoIntra           offset       dist      
 Min.   :    1.00   Min.   :    1.00   Min.   :1   Min.   :  325  
 1st Qu.:    3.00   1st Qu.:    3.00   1st Qu.:1   1st Qu.: 2030  
 Median :   14.00   Median :   14.00   Median :1   Median : 3994  
 Mean   :   97.29   Mean   :   97.29   Mean   :1   Mean   : 5049  
 3rd Qu.:   53.00   3rd Qu.:   53.00   3rd Qu.:1   3rd Qu.: 7128  
 Max.   :75452.00   Max.   :75452.00   Max.   :1   Max.   :24758  
                                                                  
 SCHOOL_COUNT.x    ORIGIN_SZ_NAME      ORIGIN_SZ         PLN_AREA_N.x      
 Min.   :0.00000   Length:980967      Length:980967      Length:980967     
 1st Qu.:0.00000   Class :character   Class :character   Class :character  
 Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.09965                                                           
 3rd Qu.:0.00000                                                           
 Max.   :4.00000                                                           
                                                                           
 PLN_AREA_C.x        REGION_N.x         REGION_C.x       
 Length:980967      Length:980967      Length:980967     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
         geometry.x     SCHOOL_COUNT.y    DESTIN_SZ_NAME      DESTIN_SZ        
 MULTIPOLYGON :  5696   Min.   :0.00000   Length:980967      Length:980967     
 POLYGON      :975271   1st Qu.:0.00000   Class :character   Class :character  
 epsg:3414    :     0   Median :0.00000   Mode  :character   Mode  :character  
 +proj=tmer...:     0   Mean   :0.09126                                        
                        3rd Qu.:0.00000                                        
                        Max.   :4.00000                                        
                                                                               
 PLN_AREA_N.y       PLN_AREA_C.y        REGION_N.y         REGION_C.y       
 Length:980967      Length:980967      Length:980967      Length:980967     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
         geometry.y     ORIGIN_AGE7_12    ORIGIN_AGE13_24    ORIGIN_AGE25_64   
 MULTIPOLYGON :  6733   Min.   :   0.99   Min.   :    0.99   Min.   :    0.99  
 POLYGON      :974234   1st Qu.: 180.00   1st Qu.:  410.00   1st Qu.: 1860.00  
 epsg:3414    :     0   Median : 730.00   Median : 1580.00   Median : 7370.00  
 +proj=tmer...:     0   Mean   :1086.96   Mean   : 2431.61   Mean   :11253.33  
                        3rd Qu.:1610.00   3rd Qu.: 3470.00   3rd Qu.:16780.00  
                        Max.   :6340.00   Max.   :16380.00   Max.   :74610.00  
                                                                               
 DESTIN_AGE7_12    DESTIN_AGE13_24    DESTIN_AGE25_64   
 Min.   :   0.99   Min.   :    0.99   Min.   :    0.99  
 1st Qu.:  90.00   1st Qu.:  170.00   1st Qu.:  900.00  
 Median : 620.00   Median : 1300.00   Median : 6120.00  
 Mean   : 985.03   Mean   : 2196.56   Mean   :10191.93  
 3rd Qu.:1480.00   3rd Qu.: 3290.00   3rd Qu.:15830.00  
 Max.   :6340.00   Max.   :16380.00   Max.   :74610.00  
                                                        

Unconstrained Spatial Interaction Model

uncSIM <- glm(formula = TRIPS ~                  
                log(ORIGIN_AGE7_12) +                  
                log(DESTIN_AGE7_12) +                 
                log(dist),               
              family = poisson(link = "log"),               
              data = flow_data1,               
              na.action = na.exclude) 
uncSIM

Call:  glm(formula = TRIPS ~ log(ORIGIN_AGE7_12) + log(DESTIN_AGE7_12) + 
    log(dist), family = poisson(link = "log"), data = flow_data1, 
    na.action = na.exclude)

Coefficients:
        (Intercept)  log(ORIGIN_AGE7_12)  log(DESTIN_AGE7_12)  
           10.25035              0.09730             -0.01915  
          log(dist)  
           -0.78412  

Degrees of Freedom: 980966 Total (i.e. Null);  980963 Residual
Null Deviance:      370600000 
Residual Deviance: 308100000    AIC: 312500000

R-squared function

CalcRSquared <- function(observed,estimated){   
  r <- cor(observed,estimated)   
  R2 <- r^2   
  R2 }
CalcRSquared(uncSIM$data$TRIPS, uncSIM$fitted.values)
[1] 0.02258983
r2_mcfadden(uncSIM)
# R2 for Generalized Linear Regression
       R2: 0.167
  adj. R2: 0.167

Origin (Production) constrained SIM

orcSIM <- glm(formula = TRIPS ~                   
                ORIGIN_GRID +                  
                log(DESTIN_AGE7_12) +                  
                log(dist),               
            family = poisson(link = "log"),               
            data = flow_data1,               
            na.action = na.exclude) 
summary(orcSIM)

Call:
glm(formula = TRIPS ~ ORIGIN_GRID + log(DESTIN_AGE7_12) + log(dist), 
    family = poisson(link = "log"), data = flow_data1, na.action = na.exclude)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          1.089e+01  8.407e-04 12959.7   <2e-16 ***
ORIGIN_GRID         -4.072e-05  4.995e-08  -815.2   <2e-16 ***
log(DESTIN_AGE7_12)  9.087e-03  3.737e-05   243.2   <2e-16 ***
log(dist)           -7.755e-01  1.017e-04 -7622.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 370603551  on 980966  degrees of freedom
Residual deviance: 312431320  on 980963  degrees of freedom
AIC: 316887141

Number of Fisher Scoring iterations: 7

We can examine how the constraints hold for destinations this time

CalcRSquared(orcSIM$data$TRIPS, orcSIM$fitted.values)
[1] 0.02050437

Destination constrained

decSIM <- glm(formula = TRIPS ~                  
                DESTIN_GRID +                  
                log(ORIGIN_AGE7_12) +                  
                log(dist),               
              family = poisson(link = "log"),              
              data = flow_data1,               
              na.action = na.exclude) 
summary(decSIM)

Call:
glm(formula = TRIPS ~ DESTIN_GRID + log(ORIGIN_AGE7_12) + log(dist), 
    family = poisson(link = "log"), data = flow_data1, na.action = na.exclude)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          1.039e+01  8.519e-04 12195.4   <2e-16 ***
DESTIN_GRID         -4.864e-05  5.027e-08  -967.6   <2e-16 ***
log(ORIGIN_AGE7_12)  9.860e-02  4.542e-05  2170.7   <2e-16 ***
log(dist)           -7.713e-01  1.009e-04 -7640.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 370603551  on 980966  degrees of freedom
Residual deviance: 307400985  on 980963  degrees of freedom
AIC: 311856806

Number of Fisher Scoring iterations: 7

We can examine how the constraints hold for destinations this time.

CalcRSquared(decSIM$data$TRIPS, decSIM$fitted.values)
[1] 0.02321934

Doubly constrained

dbcSIM <- glm(formula = TRIPS ~                  
                ORIGIN_GRID +                  
                DESTIN_GRID +                  
                log(dist),               
              family = poisson(link = "log"),               
              data = flow_data1,               
              na.action = na.exclude) 
summary(dbcSIM)

Call:
glm(formula = TRIPS ~ ORIGIN_GRID + DESTIN_GRID + log(dist), 
    family = poisson(link = "log"), data = flow_data1, na.action = na.exclude)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.093e+01  8.178e-04 13368.8   <2e-16 ***
ORIGIN_GRID -7.140e-05  1.147e-07  -622.4   <2e-16 ***
DESTIN_GRID  3.658e-05  1.153e-07   317.2   <2e-16 ***
log(dist)   -7.797e-01  1.016e-04 -7676.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 370603551  on 980966  degrees of freedom
Residual deviance: 312390431  on 980963  degrees of freedom
AIC: 316846252

Number of Fisher Scoring iterations: 7

Model comparison

Using compare_performance() of performance package

model_list <- list(unconstrained=uncSIM,                    
                   originConstrained=orcSIM,                    
                   destinationConstrained=decSIM,                    
                   doublyConstrained=dbcSIM)
compare_performance(model_list,                     
                    metrics = "RMSE")
# Comparison of Model Performance Indices

Name                   | Model |    RMSE
----------------------------------------
unconstrained          |   glm | 520.042
originConstrained      |   glm | 520.589
destinationConstrained |   glm | 519.871
doublyConstrained      |   glm | 520.656

This compute the RMSE of all the models in model_list file.

The print above reveals that all the SIMs are of similar value.

Visualising fitted

We will extract the fitted values from each model, join the values to data_flow1 data frame, for all the models.

df <- as.data.frame(uncSIM$fitted.values) %>%   
  round(digits = 0)   
flow_data1 <- flow_data1 %>%    
  cbind(df) %>%    
  rename(uncTRIPS = "uncSIM$fitted.values")
df <- as.data.frame(orcSIM$fitted.values) %>%    
  round(digits = 0)   
flow_data1 <- flow_data1 %>%    
  cbind(df) %>%    
  rename(orcTRIPS = "orcSIM$fitted.values")
df <- as.data.frame(decSIM$fitted.values) %>%    
  round(digits = 0)   
flow_data1 <- flow_data1 %>%    
  cbind(df) %>%    
  rename(decTRIPS = "decSIM$fitted.values")
df <- as.data.frame(dbcSIM$fitted.values) %>%    
  round(digits = 0)   
flow_data1 <- flow_data1 %>%    
  cbind(df) %>%    
  rename(dbcTRIPS = "dbcSIM$fitted.values")

Now we plot it:

unc_p <- ggplot(data = flow_data1,                 
                aes(x = uncTRIPS,                     
                    y = TRIPS)) +   
  geom_point() +   geom_smooth(method = lm)  

orc_p <- ggplot(data = flow_data1,                 
                aes(x = orcTRIPS,                     
                    y = TRIPS)) +   
  geom_point() +   
  geom_smooth(method = lm)  

dec_p <- ggplot(data = flow_data1,                 
                aes(x = decTRIPS,                     
                    y = TRIPS)) +   
  geom_point() +   
  geom_smooth(method = lm)  

dbc_p <- ggplot(data = flow_data1,                 
                aes(x = dbcTRIPS,                     
                    y = TRIPS)) +   
  geom_point() +   geom_smooth(method = lm)  

ggarrange(unc_p, orc_p, dec_p, dbc_p,           
          ncol = 2,           
          nrow = 2)

From the above, we can draw the conclusion that there isnt much of a correlation between the age 7 to 12 population and number of trips across the hexagon areas O/D.